World-wide Covid-19 Data Analyses and Predictions

University of Hawaii at Manoa, ICS 691 Final Project.
Name: Saroj Pathak (Spathak@hawaii.edu)

Introduction:

First detected in the Wuhan city of China in December 2019, 2019 Novel Coronavirus (Covid 19), a respiratory illness causing virus, brought a huge pandemic all over the world after a few months of its origin because of its highly spreading ability. It mainly causes several respiratory-related problems to humans. Millions of people, from across the world, have lost their lives, already, due to this pandemic and still, thousands more are losing their lives every new day. Although the pandemic is still not over, this project aims to analyze and visualize how different countries in the world are affected by this virus. The project also aims to make time-series based predictive models for future predictions of confirmed cases and death cases. Data for the analysis purpose are obtained from Kaggle and GitHub. All Covid-19 data are loaded from John Hopkins CSSE data repository. Country code data are also loaded directly from a Github page. Population data are obtained from the kaggle. Coronavirus Image

Objectives:

 To visualize the total confirmed cases, total death toll, and total recovered cases, across the world, through a map visualization.
 To visualize different statistics like (confirmed/population), (death/confirmed), and (death/population) for the latest corona data available on a contry-wise basis.
 Find out the top 10 and bottom 10 countries in terms of (confirmed/population), (death/confirmed), and (death/population) cases.
 Doing cluster analysis of the countries based on confirmed, (death/confirmed), and (death/population) cases.
 Making a time series based predictive model for the prediction of the confirmed cases and death cases in the future.

C:\Users\saroj\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.cluster.k_means_ module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.
  warnings.warn(message, FutureWarning)

Visualization of Latest Covid-19 Data

Confirmed Cases:

Death Toll:

Recovered Cases:

All of the above visualization just provides straightforward information. Having more cases in the countries which have more population, compared to the countries which have less population, is normal and we can't compare them based on just the number of cases. So, to compare how different countries are affected we should normalize the above data by some parameters.

Percentage of Confirmed Cases Based on Population

Clearly, South American and Europian countries are more affected compared to Asian and African countries.

Percentage of Death Cases Based on Confirmed Cases

Percentage of Death Cases Based on Population

Clearly, South American and Europian countries are more affected compared to Asian and African countries.

Time-series Based Clustering Based on Confirmed Cases

Time-series Based Clustering Based on Death to Confirmed Percentage

Time-series Based Clustering Based on Death to Population Percentage

Time-series Based Prediction for Confirmed Cases

((300, 1), (30, 1))
C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:159: ValueWarning:

No frequency information was provided, so inferred frequency D will be used.

C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\holtwinters.py:731: RuntimeWarning:

invalid value encountered in greater_equal

C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\holtwinters.py:743: ConvergenceWarning:

Optimization failed to converge. Check mle_retvals.

C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:159: ValueWarning:

No frequency information was provided, so inferred frequency D will be used.

C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\holtwinters.py:725: RuntimeWarning:

invalid value encountered in less_equal

Model performance (Mean Absolute Percentage Error (MAPE)) = 1.9481406496817029 %
Total number of confirmed cases (world-wide) after a month = 93763493
Total number of confirmed cases (world-wide) within next one month = 19553143

Time-series Based Model Prediction for Death Toll

I am using 'ExponentialSmoothing' to train and test the model. All available data before one month, from the day of the latest update, is used for training and the latest one month data is used for testing the model. Upon accessing the Mean Absolute Percentage Error (MAPE), model is utilized for the future forecasting. One month of future forecasting is done using all available data as the training sample.

((300, 1), (30, 1))
C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:159: ValueWarning:

No frequency information was provided, so inferred frequency D will be used.

C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\holtwinters.py:743: ConvergenceWarning:

Optimization failed to converge. Check mle_retvals.

C:\Users\saroj\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:159: ValueWarning:

No frequency information was provided, so inferred frequency D will be used.

Model performance (Mean Absolute Percentage Error (MAPE)) = 1.5063783846618073 %
Total number of death cases (world-wide) after a month = 1999767
Total number of death cases (world-wide) within next one month = 350811

Conclusion:

All the above analyses and visualization of Covid-19 data show how the whole world is suffering from this virus although the amount of human loss is not same everywhere. Also, the predictive model, developed at the end, indicates the huge human loss within next month. Hopefully, Covid-19 vaccines will be effective to prevent the predicted loss.